temp weekday cost price sales
1 17.3 6 1.5 5.6 173
2 25.4 3 0.3 4.9 196
3 23.3 5 1.5 7.6 207
4 26.9 1 0.3 5.3 241
5 20.2 1 1.0 7.2 227
6 26.1 6 0.5 6.6 193
Causal Inference and Machine Learning.
Code and then Download Zip.double-ml.qmd in RStudio.Introduction
Inference
Causality
Causal Inference
Machine Learning
Double Machine Learning / Debiased Machine Learning
Your Turn!
Conclusion
Inference means that we are attempting to learn (i.e., estimate) about the value of a population parameter that we can’t observe.
For example, we might estimate \(\hat{\beta}\) using OLS, our estimator for \(\beta\) in the population, where \(\beta\) describes the relationship (i.e., slope) between \(X\) and \(Y\).
Causality is not the same as inference.
We can infer values that are non-causal.
Causality is all about your causal identification strategy.
Your identification strategy is how you justify making causal claims given your research design.
What does it even mean to “cause” something?
Causal inference is the combination of an identification strategy and statistical inference.
It requires:
Convincing your reader that your estimand matches your theory.
Convincing your reader that your estimator captures your estimand.
Convincing your reader that the relationship represented by your estimand is causal.
Convincing your reader that your estimator is not confounded or collided.
Inference: this is the statistical task of using a sample to learn about a population.
Causal inference: this is the research design task of convincing readers that the parameters you’re making inferences about represent causal relationships and not simply spurious correlations.
The goal of machine learning is to learn (estimate) a function of interest given the data that go into and come out of the function.
Think of functions like we do in mathematics: \(f(x) \rightarrow y\)
\(x\) goes in and \(y\) comes out.
In most math courses (e.g., algebra), we know \(x\) and we know \(f(\cdot)\); we’re solving for \(y\).
In machine learning, we know \(x\) and we know \(y\); we’re solving for \(f(\cdot)\).
We use machine learning for two primary purposes:
OLS is great for inference:
\(\beta\) is easy to interpret.
“A one unit increase in \(x_i\) corresponds to an expected \(\beta_i\) increase in \(y\), holding all other \(x_j\) constant.”
OLS is not great for prediction:
Often, no reason to expect constant linear relationships between \(X\) and \(Y\).
OLS chokes on problems with many independent variables.
OLS assumes a very strict functional form and interactions or transformations must be input manually.
We often don’t know how \(X\) and \(Y\) are related.
Most ML methods make weaker assumptions about \(f(\cdot)\).
Many ML models can learn transformations, variable selection, and interactions from the data!
But, this comes at a cost:
Complicated (or even unknown) functional forms.
No concept of standard errors needed for inference.
Often ML is great for prediction, bad for inference.
OLS Linear Regression
\(y = \alpha + \beta x + \beta_{1}z_{1} + \ldots + \beta_k z_k + u\)
\(y\): dependent variable (outcome)
\(x\): independent variable of interest (treatment)
\(z_j\): control variables / confounders (not interesting)
OLS Linear Regression
\(y = \alpha + \beta x + \beta_{1}z_{1} + \ldots + \beta_k z_k + u\)
Can we have the best of both worlds?
What if we could assume a linear relationship between \(x\) and \(y\)…
…and make fewer assumptions about the relationship between \(z\) and \(y\)?
In double machine learning, we can control for many non-linear confounds \((z)\) using any machine learning model.
Then, we can use a standard regression model to estimate the effect of \(x\) on \(y\).
Predictive performance and flexibility of machine learning!
Interpretability and uncertainty of traditional statistical inference!
Call:
lm(formula = sales ~ price + weekday, data = data)
Residuals:
Min 1Q Median 3Q Max
-61.943 -13.669 -2.966 11.574 60.076
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 192.5306 1.0815 178.019 < 2e-16 ***
price 1.2285 0.1623 7.570 4.06e-14 ***
weekday 0.1111 0.0960 1.158 0.247
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.37 on 9997 degrees of freedom
Multiple R-squared: 0.00584, Adjusted R-squared: 0.005641
F-statistic: 29.36 on 2 and 9997 DF, p-value: 1.928e-13
y_on_z <- lm(sales ~ weekday, data=data)
x_on_z <- lm(price ~ weekday, data=data)
y_on_x <- lm(y_on_z$resid ~ x_on_z$resid)
summary(y_on_x)
Call:
lm(formula = y_on_z$resid ~ x_on_z$resid)
Residuals:
Min 1Q Median 3Q Max
-61.943 -13.669 -2.966 11.574 60.076
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.936e-15 1.937e-01 0.00 1
x_on_z$resid 1.229e+00 1.623e-01 7.57 4.05e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.37 on 9998 degrees of freedom
Multiple R-squared: 0.0057, Adjusted R-squared: 0.0056
F-statistic: 57.31 on 1 and 9998 DF, p-value: 4.049e-14
Standard Linear Regression
Estimate Std. Error t value Pr(>|t|)
(Intercept) 192.530636 1.08151586 178.019245 0.000000e+00
price 1.228518 0.16228712 7.570028 4.061027e-14
weekday 0.111129 0.09599697 1.157630 2.470427e-01
Frisch-Waugh-Lovell
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.936429e-15 0.1936713 9.998533e-15 1.000000e+00
x_on_z$resid 1.228518e+00 0.1622790 7.570407e+00 4.049241e-14
Original OLS:
price 1.228518 0.16228712 7.570028 4.061027e-14
Frisch-Waugh-Lovell regression:
x_on_z$resid 1.228518 0.1622790 7.570407 4.049241e-14
To estimate the treatment effect of \(x\) on \(y\) in the presence of confounders \(z\), we could:
We don’t have to use OLS for the first (or second) stage!
We can use any method we like to control for the confounding variables.
This includes methods that are much more flexible than OLS.
You may need to run the following code in your Console to install the randomForest package.
Load the randomForest package:
First, let’s regress \(y\) (sales) on \(z\) (weekday), our confounder.
Call:
lm(formula = rf_sales_on_weekday_resid ~ rf_price_on_weekday_resid)
Residuals:
Min 1Q Median 3Q Max
-60.385 -8.754 0.257 8.887 57.115
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00499 0.13320 0.037 0.97
rf_price_on_weekday_resid -3.45619 0.12004 -28.792 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.32 on 9998 degrees of freedom
Multiple R-squared: 0.07657, Adjusted R-squared: 0.07647
F-statistic: 829 on 1 and 9998 DF, p-value: < 2.2e-16
x y z
Min. :-1.2236 Min. : 5.821 Min. :-5.9762
1st Qu.: 0.6654 1st Qu.: 7.994 1st Qu.:-1.5420
Median : 1.1981 Median : 9.635 Median :-0.1643
Mean : 1.1889 Mean :11.955 Mean :-0.1116
3rd Qu.: 1.6725 3rd Qu.:13.722 3rd Qu.: 1.3245
Max. : 3.5069 Max. :45.408 Max. : 6.1052
Estimate an OLS linear regression of \(y = \alpha + \beta_1 x + \beta_2 z + u\).
What is the estimated effect of \(x\) on \(y\)?
Estimate a random forest model of \(y = z\). Compute the residuals and store them in a vector called y_on_z_resid.
Estimate a random forest model of \(x = z\). Compute the residuals and store them in a vector called x_on_z_resid.
Estimate a linear model of y_on_z_resid ~ x_on_z_resid using the lm(...) function.
Summarize your model and determine the estimated effect of x on y.
You could overfit:
Machine learning algorithms can sometimes fit the data too well.
This would cause your residuals to have low variance.
This could cause you to underestimate your effect or standard errors.
The solution is to use k-fold cross-prediction:
Estimate your ML algorithms on partitions of the data.
Then, predict values out-of-sample to use in your third stage.
The Frisch-Waugh-Lovell Theorem says:
Double Machine Learning says:
Here, we get the benefits of flexible ML algorithms for the control variables and the interpretability of OLS for the treatment variable.
Chernozhukov et al. 2018. https://doi.org/10.1111/ectj.12097.